Credit Card Users Churn Prediction

Problem Statement

Business Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

Data Description

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank (in months)
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: Total Revolving Balance on the Credit Card
  • Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio: Average Card Utilization Ratio

What Is a Revolving Balance?

  • If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance
What is the Average Open to buy?
  • 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?
  • The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:
  • ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

Please read the instructions carefully before starting the project.

This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.

  • Blanks '___' are provided in the notebook that needs to be filled with an appropriate code to get the correct result. With every '___' blank, there is a comment that briefly describes what needs to be filled in the blank space.
  • Identify the task to be performed correctly, and only then proceed to write the required code.
  • Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
  • Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
  • Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.

Importing necessary libraries

In [247]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn import metrics
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

Loading the dataset

In [195]:
#Read CSV
df = pd.read_csv("BankChurners.csv")

#Remove duplicate index, CLIENTNUM
df.drop("CLIENTNUM", inplace=True, axis=1)

print(df.shape)
df.head()
(10127, 20)
Out[195]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
3 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760
4 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000

We have a pretty sizeable dataset here, with 10127 total rows and 20 columns (after removing CLIENTNUM). I can already tell some of them are going to need some pre-processing. Education_Level and Income_Category are currently categorical but can be made psuedo-numerical; I want to put numbers there to show how much education/income each row has relative to the rest of the dataset.

Data Overview

  • Observations
  • Sanity checks
In [196]:
#Check for any immediate problems with data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Attrition_Flag            10127 non-null  object 
 1   Customer_Age              10127 non-null  int64  
 2   Gender                    10127 non-null  object 
 3   Dependent_count           10127 non-null  int64  
 4   Education_Level           8608 non-null   object 
 5   Marital_Status            9378 non-null   object 
 6   Income_Category           10127 non-null  object 
 7   Card_Category             10127 non-null  object 
 8   Months_on_book            10127 non-null  int64  
 9   Total_Relationship_Count  10127 non-null  int64  
 10  Months_Inactive_12_mon    10127 non-null  int64  
 11  Contacts_Count_12_mon     10127 non-null  int64  
 12  Credit_Limit              10127 non-null  float64
 13  Total_Revolving_Bal       10127 non-null  int64  
 14  Avg_Open_To_Buy           10127 non-null  float64
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 16  Total_Trans_Amt           10127 non-null  int64  
 17  Total_Trans_Ct            10127 non-null  int64  
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 19  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(9), object(6)
memory usage: 1.5+ MB

At least two columns have missing data: Education_Level and Marital_Status. We should also check the columns labeled "object", since those might have missing data in the form of "null strings". The rest seem safe: the data types are int or float and the non-nulls equal the number of total rows.

In [197]:
df.describe().T
Out[197]:
count mean std min 25% 50% 75% max
Customer_Age 10127.0 46.325960 8.016814 26.0 41.000 46.000 52.000 73.000
Dependent_count 10127.0 2.346203 1.298908 0.0 1.000 2.000 3.000 5.000
Months_on_book 10127.0 35.928409 7.986416 13.0 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.0 3.812580 1.554408 1.0 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.0 2.341167 1.010622 0.0 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.0 2.455317 1.106225 0.0 2.000 2.000 3.000 6.000
Credit_Limit 10127.0 8631.953698 9088.776650 1438.3 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.0 1162.814061 814.987335 0.0 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.0 7469.139637 9090.685324 3.0 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.0 0.759941 0.219207 0.0 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.0 4404.086304 3397.129254 510.0 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.0 64.858695 23.472570 10.0 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.0 0.712222 0.238086 0.0 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.0 0.274894 0.275691 0.0 0.023 0.176 0.503 0.999

Exploratory Data Analysis (EDA)

Data Pre-processing

In [198]:
#Separate columns into categorical and numerical lists for EDA
cat_columns = ['Attrition_Flag', 'Customer_Age', 'Gender', 'Education_Level', 'Marital_Status', 
              'Income_Category', 'Card_Category']

num_columns = [col for col in df.columns if col not in cat_columns]
In [199]:
for col in cat_columns:
    sns.histplot(df[col])
    plt.xticks(rotation=30)
    plt.show()
In [200]:
df['Income_Category'].value_counts()
Out[200]:
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: Income_Category, dtype: int64

Observations

Attrition_Flag: This is the target variable, and we can see that it's imbalanced. We're trying to make a model that will predict customer attrition, but as things are a model that is trying to be as accurate as possible will be biased toward predicting the majority class. We will try some techniques later to resolve this.

Customer Age: Some spikes at the beginning and end suggesting some data is truncated, but these already represent the extremes so I don't see a problem here. Looks like a standard age distribution.

Gender: No surprises here.

Dependent_count: Again, looks fine.

Education_Level: These levels are not in order. I'd like to turn this category into "psuedo-numerical" with integers 1-6 so that they can be arranged in order of increasing education. I'm also a little surprised at the dominance of the "Graduate" category.

Marital_Status: Seems fine to me.

Income_Category: Like Education_Level, I'd like this to be arranged in order of increasing income. I can perhaps replace each value with the median of its limits (e.g. "60K - 80K" --> 70). We also have the "abc" values which are just missing.

Card_Category: Almost all customers are just using the Blue card. I expect this to not matter for most, but maybe the gold/silver/platinum will have different attrition rates.

In the end, we'll one hot encode all the categorical data so we are working with pure numbers.

In [201]:
#Fix the categorical data as mentioned above.

df['flagged'] = df['Attrition_Flag'] == "Attrited Customer"
df['flagged'] = df['flagged'].astype(int)
df = df.drop("Attrition_Flag", axis=1)
In [202]:
#Replace Education_Level with tiered categories. Missing values should be replaced by mode (Graduate).
educations = {
    "Uneducated": 0, "High School": 1, "College": 2,
    "Graduate": 3, "Post-Graduate": 4, "Doctorate": 5
}

df['Education_Level'].fillna(3, inplace=True)
df['Education_Level'].replace(educations, inplace=True)
In [212]:
sns.histplot(df['Education_Level'])
Out[212]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f1b4adce08>
In [203]:
#Replace Income_Category with approximate average income for the category.
#At $120K+ it doesn't really matter how high it is as long as the model knows it's higher than the rest.
incomes = {
    "Less than $40K": 30,
    "$40K - $60K": 50,
    "$60K - $80K": 70,
    "$80K - $120K": 100,
    "$120K +": 140
}

df["Income_Category"].replace(incomes, inplace=True)

#Regarding the "abc" values, I will replace them with the mode (30) since that's much more represented than the others.
df['Income_Category'].replace("abc", 30, inplace=True)
In [213]:
sns.histplot(df['Income_Category'])
Out[213]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f1b4c2bb08>
In [205]:
"""
For Marital Status I'm not going to impute missing values. There is a mode (Married), but Single is nearly as
common. I'm going to do one hot encoding and just leave all the columns as 0 for the ones that are missing. The 
models will just not have a 1 to work with on those rows.
"""

one_hot = pd.get_dummies(df['Marital_Status'])
df = df.drop("Marital_Status", axis=1)
df = df.join(one_hot)
In [224]:
df = df.join(pd.get_dummies(df["Gender"])).drop("Gender", axis=1)
df = df.join(pd.get_dummies(df["Card_Category"])).drop("Card_Category", axis=1)
In [225]:
df
Out[225]:
Customer_Age Dependent_count Education_Level Income_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal ... flagged Divorced Married Single F M Blue Gold Platinum Silver
0 45 3 1 70 39 5 1 3 12691.0 777 ... 0 0 1 0 0 1 1 0 0 0
1 49 5 3 30 44 6 1 2 8256.0 864 ... 0 0 0 1 1 0 1 0 0 0
2 51 3 3 100 36 4 1 0 3418.0 0 ... 0 0 1 0 0 1 1 0 0 0
3 40 4 1 30 34 3 4 1 3313.0 2517 ... 0 0 0 0 1 0 1 0 0 0
4 40 3 0 70 21 5 1 0 4716.0 0 ... 0 0 1 0 0 1 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10122 50 2 3 50 40 3 2 3 4003.0 1851 ... 0 0 0 1 0 1 1 0 0 0
10123 41 2 3 50 25 4 2 3 4277.0 2186 ... 1 1 0 0 0 1 1 0 0 0
10124 44 1 1 30 36 5 3 4 5409.0 0 ... 1 0 1 0 1 0 1 0 0 0
10125 30 2 3 50 36 4 3 3 5281.0 0 ... 1 0 0 0 0 1 1 0 0 0
10126 43 2 3 30 25 6 2 4 10388.0 1961 ... 1 0 1 0 1 0 0 0 0 1

10127 rows × 26 columns

Now it's time to invesigate the numerical variables. We can use "flagged" as a hue to get a hint about their relationship to the target variable.

In [207]:
for col in num_columns:
    sns.boxplot(df, x="flagged", y=col)
    plt.xlabel(col)
    plt.ylabel("")
    plt.show()

Observations:

Dependent_count: A slight difference, with flagged customers having slightly more dependents on average.

Months_on_book: Almost no difference.

Total_Relationship_Count: Flagged customers having slightly fewer products held on average.

Months_Inactive_12_mon: Unsurprising that flagged customers have more inactive months on average.

Contacts_Count_12_mon: Flagged customers do have more contacts with the bank on average.

Credit_Limit: Little difference.

Total_Revolving_Bal: Pretty significant difference; flagged customers have much less revolving balance.

Avg_Open_To_Buy: Nothing. Total_Trans_Amt: Pretty similar except high outliers aren't represented amongst flagged customers.

Total_Trans_Ct: One of the better predictors so far. Flagged customers have far fewer transactions.

Total_Ct_Chng_Q4_Q1: Slight difference. Lower on average for flagged customers.

Avg_Utilization_Ratio: Pretty good test, as well. Most flagged characters have close to zero utilization, with many having actually zero.

Finally, we'll do a quick overview of relationships between features. A pair-plot is oging to be really large but we might be able to glean some insight about the relationships between some of these features.

In [208]:
sns.pairplot(df)
Out[208]:
<seaborn.axisgrid.PairGrid at 0x1f19be31648>

A lot of the categorical variables are nearly perfectly uncorrelated with each other, but the numerical variables do have some connection. Specifically, "Credit Limit" and "Avg_Open_To_Buy" are nearly perfectly correlated. We might do well to remove one of those because they're effectively duplicates. This makes sense to me because Avg_Open_to_Buy is just the amount of the total Credit Limit that is still available to be used, which should be redundant when combined with Total Revolving Balance.

In [156]:
df = df.drop("Avg_Open_To_Buy",axis=1)

The other relationships seem more natural. Credit Limit and Utilization Ratio are negatively correlated. Transaction Count and Transaction Amount are positively correlated (duh), but not nearly as cleanly as you might expect. The amount per transaction might vary a bit. I wonder if it's worth putting in a new column for that ratio.

In [159]:
df['amount_per_transaction'] = df["Total_Trans_Amt"] / df["Total_Trans_Ct"]
sns.boxplot(x=df['flagged'], y=df['amount_per_transaction'])
Out[159]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f199ca0b48>

This is actually kind of curious because both Transaction Amount and Transaction Count were both different for Existing vs. Attrited customers. Yet the ratio of amount/count is close to the same either way.

Questions:

  1. How is the total transaction amount distributed?

Right-skewed, but especially so for Existing Customers.

  1. What is the distribution of the level of education of customers?

Most are at the Graduate level, with some not reaching that level. Post-Graduate education is somewhat rare.

  1. What is the distribution of the level of income of customers?

Most of the customers are relatively poor, with <$40k being the most common category by far.

  1. How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?

While the majority of customers are below 1.0 (meaning higher in Q4), having a ratio greater than 1.0 was much more likely to be associated with a non-Attrited customer.

  1. How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?

Surprisingly, there was a lot of overlap in this category between the two types of customers. They both had similar numbers of inactive months, but the spread of the distribution was greater for Existing customers (not surprising, since there are more of them).

  1. What are the attributes that have a strong correlation with each other?

Credit Limit and Utilization Ratio are negatively correlated. Transaction Count and Transaction Amount are positively correlated (duh), but not nearly as cleanly as you might expect. The amount per transaction might vary a bit.

Model Building

In [228]:
#Designate target variable as Y, the rest as X.

y = df['flagged']
X = df.drop("flagged", axis=1)
In [229]:
#Split the data into three parts: training, validation, and testing.

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)

print(X_train.shape, X_val.shape, X_test.shape)
(6075, 25) (2026, 25) (2026, 25)

Model evaluation criterion

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [230]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1

        },
        index=[0],
    )

    return df_perf

Model Building with original data

Sample code for model building with original data

In [231]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))

print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_train, model.predict(X_train))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores_val = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores_val))
Training Performance:

Bagging: 0.9846311475409836
Random forest: 1.0
GBM: 0.8709016393442623
Adaboost: 0.8452868852459017
dtree: 1.0

Validation Performance:

Bagging: 0.8098159509202454
Random forest: 0.8312883435582822
GBM: 0.8588957055214724
Adaboost: 0.8773006134969326
dtree: 0.7883435582822086

We can see that some models tend toward overfitting more than others. Bagging, Random Forest, and Decision trees all have much better performance in test than in validation. GBM and AdaBoost are not overfit, which is a great sign.

Model Building with Oversampled data

We can potentially do better by rebalancing the target data. We will try oversampling and undersampling.

In [241]:
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
In [242]:
print("Before Over Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Over Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Over Sampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Over Sampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))

print("After Over Sampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Over Sampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Over Sampling, counts of label 'Yes': 976
Before Over Sampling, counts of label 'No': 5099 

After Over Sampling, counts of label 'Yes': 5099
After Over Sampling, counts of label 'No': 5099 

After Over Sampling, the shape of train_X: (10198, 25)
After Over Sampling, the shape of train_y: (10198,) 

In [243]:
print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_train_over, model.predict(X_train_over))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Training Performance:

Bagging: 0.9976465973720338
Random forest: 1.0
GBM: 0.9807805452049422
Adaboost: 0.961953324181212
dtree: 1.0

Validation Performance:

Bagging: 0.8558282208588958
Random forest: 0.8680981595092024
GBM: 0.9141104294478528
Adaboost: 0.8865030674846626
dtree: 0.8404907975460123

While performance is better overall on the training data, mostly what this has done is caused more overfitting. Our Recall scores are up compared to before, but we can tell that even our GBM and Adaboost models are overfit. It's not too bad, though; I would still take 91% recall from GBM compared to the 85% it was before.

Model Building with Undersampled data

In [244]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [245]:
print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_train_un, model.predict(X_train_un))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Training Performance:

Bagging: 0.9897540983606558
Random forest: 1.0
GBM: 0.9795081967213115
Adaboost: 0.9528688524590164
dtree: 1.0

Validation Performance:

Bagging: 0.9294478527607362
Random forest: 0.9386503067484663
GBM: 0.9570552147239264
Adaboost: 0.9570552147239264
dtree: 0.9141104294478528

This is the best results we've seen so far. Up to 95% recall on the validation performance, and no overfitting on GBM/Adaboost (the others are still overfit, but for the trees that's likely because we haven't pruned them - this can happen in the next step).

HyperparameterTuning

Sample tuning method for Random Forest with Undersampled Data

In [271]:
# defining model
model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": [50,110,25],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}

scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV, on undersampled training data
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

#Check performance of this model on the validation set
model = RandomForestClassifier(**randomized_cv.best_params_, random_state=1)
model.fit(X_train_un,y_train_un)
print(f"Validation Recall: {recall_score(y_val, model.predict(X_val))}")
Best parameters are {'n_estimators': 110, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.9293092621664052:
Validation Recall: 0.9171779141104295

Sample tuning method for Gradient Boosting with Undersampled Data

In [269]:
# defining model
model = GradientBoostingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

#Check performance of this model on the validation set
model = GradientBoostingClassifier(**randomized_cv.best_params_, random_state=1)
model.fit(X_train_un,y_train_un)
print(f"Validation Recall: {recall_score(y_val, model.predict(X_val))}")
Best parameters are {'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05} with CV score=0.9467294610151754:
Validation Recall: 0.950920245398773

Sample tuning method for AdaBoost with Undersampled data

In [270]:
# defining model
model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

#Check performance of this model on the validation set
model = AdaBoostClassifier(**randomized_cv.best_params_, random_state=1)
model.fit(X_train_over,y_train_over)
print(f"Validation Recall: {recall_score(y_val, model.predict(X_val))}")
Best parameters are {'n_estimators': 75, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=1, splitter='best')} with CV score=0.9539204140930171:
Validation Recall: 0.9079754601226994

Model Comparison and Final Model Selection

Of the ones tuned, the best performance belongs to the Gradient Booster Classifier on Undersampled data, with a 95% recall score on the validation set (the training was 94%, so not overfit either). Therefore this is the model I will choose.

Let's see how it performs against the heretofore unseen test data!

Test set final performance

In [277]:
finalmodel = GradientBoostingClassifier(random_state=1)
params = {'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05}

finalmodel.fit(X_train_un,y_train_un)
y_pred = finalmodel.predict(X_test)
print(f"Test Set Recall: {recall_score(y_test, y_pred)}")
Test Set Recall: 0.963076923076923

Looks pretty good! Now let's break down the model for insights.

Business Insights and Conclusions

In [278]:
model_performance_classification_sklearn(finalmodel, X_test, y_test)
confusion_matrix(y_test, y_pred)
Out[278]:
array([[1579,  122],
       [  12,  313]], dtype=int64)

We correctly identified 313 out of 325 (96%) of Attrited customers. There were 122 False Positives, but in this business application there is not as much of a cost here.

When a customer is flagged as a possible attrition candidate, the business's response should be to target that individual with incentives to remain a customer. This could be special deals offered to them, perhaps, or some other form of specialized treatment. Even if we flag a customer as potential attrition, it's not a heavy cost to offer the same treatment. However, not offering these deals and losing a customer as a result can be expensive. This is why we prioritized recall as our scoring metric.

Finally, let's look at which features ended up being the most determinant for the predictions.

In [284]:
feature_names = X_train.columns
importances = finalmodel.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

This case is good evidence to suggest that simpler explanations are better. If you're worried that you're losing a customer, you can look at a whole bunch of stuff like marital status, gender, and education. But next to actual usage data, it's clear they aren't as important. The most important features were just ... how many (and for how much) transactions did the customer engage in for the last 12 months? Turns out, those who had fewer transactions were more at risk of becoming Attrited -- and of course that's the case.